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The Test of English as a Foreign Language (TOEFL) was developed in 1963 by a 
National Council on the Testing of English as a Foreign Language, which was 
formed through the cooperative effort of over thirty organizations, public and 
private, that were concerned with testing the English proficiency of nonnative 
speakers of the language applying for admission to institutions in the United States. 
In 1965, Educational Testing Service (ETS) and the College Board assumed joint 
responsibility for the program and in 1973 a cooperative arrangement for the opera- 
tion of the program was entered into by ETS, the College Board, and the Graduate 
Record Examinations (GRE) Board. The membership of the College Board is com- 
posed of schools, colleges, school systems, and educational associations, GRE 
Board members are associated with graduate education. 

ETS administers the TOEFL program under the general direction of a Policy Coun- 
cil that was established by, and is affiliated with, the sponsoring organizations. 
Members of the Policy Council represent tne College Board and the GRE Board 
and such institutions and agencies as graduate schools of business, junior and 
community colleges, nonprofit educational exchange agencies, and agencies of the 
United States government. 
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tion of the TOEFL Research Committee. Its six members include representatives of 
the Policy Council, the TOEFL Committee of Examiners, and distinguished English- 
as-a-second-language specialists from the academic community. Currently the 
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research and to set guidelines for the entire scope of tne TOEFL research program. 
Members of the Research Committee serve three-year terms at the invitation of the 
Policy Council; the chair of the committee serves on the Policy Council. 

Because the studies are specific to the test and the testing program, most of the 
actual research is conducted by ETS staff rather than by outside researchers. How- 
ever, many projects require the cooperation of other institutions, particularly those 
with programs in the teaching of English as a foreign or second language. Repre- 
sentatives of such programs who are interested in participating in or conducting 
TOEFL-related research are invited to contact the TOEFL program office. Local 
research may sometimes require access to TOEFL data. In such cases, the pro- 
gram may provide this data following approval by the Research Committee. All 
TOEFL research projects must undergo appropriate ETS review to ascertain that 
the confidentiality of data will be protected. 
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Abstract 



The investigation was undertaken to provide information about the 
feasibility of reducing scoring costs by using one rater instead of the two 
that are now used for the TSE. It was concluded that because of the 
possibility of different standards among potential raters, it does not 
appear feasible to use a single rater as the sole determiner of speaking 
proficiency under the current system. Other possible alternatives to a 
single rating, relying on psychometiiw. methodology and technology, are 
discussed. The approach was to first examine the possibility of developing 
a "quality control" index that would predict the extent of the disagreement 
between two raters. The index that was developed for this purpose could 
not be validated. It was found that the best predictors of rater 
disagreement were the identities of the racers. The disagreements, 
however, resulted from the differing standards used by different raters. 
That is, raters agree substantially about the ordering of examinees but 
vary slightly in the severity of their ratings. 
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There is a growing consensus that speaking proficiency is best measured 
by evaluating directly the individual's speaking skills (Powers and 
Stansfield, 1983). The Interagency Language Round Table Oral Proficiency 
Interview is a well-known procedure that exemplifies this approach; the 
Test of Spoken English (TSE) developed at Educational Testing Service (ETS) 
is another. An important feature of each of these measures is that the 
score is derived solely from the ratings provided by language specialists. 
This means that a contingent of trained raters must be available to the 
program. One characteristic of testing programs that rely on raters is the 
high cost of "scoring" the tests. The proportion of the total budget that 
rating costs represent naturally decreases only slightly as volume 
increases but remains quite high since the cost of racing each examinee 
remains constant. By contrast, in a testing program that uses "objective" 
assessment procedures, the coot associated with scoring declines sharply as 
volume increases. This no doubt explains to some extenf the preponderance 
of objective assessnent procedures. Nevertheless, in domains such as 
speaking proficiency, ratings are the most appropriate and feasible 
measurement approach. Thus, il is important that cast-e£ fective procedures 
be found to obtain valid measurement. 

Overview of the Study 

This investigation sought to provide information that could guide a 
decision on ways of reducing TSE scoring costs. The approach taken was to 
investigate the possibility of using one rater instead of the two currently 
used. Some previous research (Bolus, Hinofotis, and Bailey, 1982) has 
demonstrated that a single rater can 1\ fact yield sufficiently adequate 
measures of proficiency. If this proved to be the case for the TSE, it 
should be possible to significantly reduce the costs of the program. 

Because the records for the program have not been "computirized," our 
first task was to create a data base* (Care was taken to document the data 
base carefully since it will facilitate future analysis for either 
administrative or research purposes.) We then examined the measurement 
characteristics of the existing rating procedures, focusing on the 
possibility that disagreement among raters could be predicted 
statistically* Our rationale for this approach was that even if a single 
rater proved sufficiently accurate, we would still need a quality-control 
mechanism to identify instances in which there would have be a large 
discrepancy if another rater were involved . We then focused on the 
characteristics of individual raters to determine whether raters tend to 
apply similar standards. 
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Description of the Test 

The TSE consists of seven sections designed to elicit different speech 
acts. The first section is a warm-up and is not scored. The composition 
of the test is presented in Table 1. 

Table 1 

Contents of the TSE 



Section Description 



Warm-up consisting of questions 
concerning examinee background 
characteristics 

Examinee is given a passage to read 
aloud. 

Examinee completes 10 partial sentences 
in a meaningful manner. 

Examinee tells a story about a series of 
related pictures. 

Examinee responds to a series of questions 
posed by the examiner concerning a 
drawing. 

Examinee Is expected to provide lengthy 
responses about topics with which he or 
she is familiar. 

Examinee sees a printed schedule and 
describes it aloud. 



The test can be administered on an individual basis or to groups using 
a language laboratory. The test questions and response stimuli appear in 
the printed test book or are heard by the examinee on the test tape. 
Examinee responses are recorded on a separate tape that is sent to ETS. 

Scoring Procedures 

Performance on the TSE is evaluated by two raters, both randomly 
assigned from a pool of about twenty raters who have a background in 
language teaching and testing and have attended a one-day rater training 
workshop at ETS. Raters evaluate examinees' performance along four 
linguistic dimensions. Three of these—grammar, fluency, and 

pronunciation— are considered diagnostic scores; the fourth dimension, 
comprehensibility, is considered to be integrative. 

12 
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To facilitate discussion or the different variables, we will adopt the 
following convention: the first and second rating of each examinee will be 
denoted as rating A and rating B, respectively. (It should be pointed out 
that these are arbitrary labels and that each rater has an equal chance of 
contributing a B or an A rating. That is, ratings A and B should be viewed 
for practical purposes as replicates of each other.) The four linguistic 
dimensions will be denoted by "Pron" for pronunciation, "Gram" for grammar, 
"Flue" for fluency, and "Comp" for comprehensibility. Finally, sections 2 
through 7 will be dented by the corresponding numerals. 

Each examinee obtains two sets of ratings. To refer to scores obtained 
under either the first or the second rating, the variable name is preceeded 
with either A or B. The number of sections and items composing each 
dimension is shown in Table 2. The item score in each case ranges from 0 
to 3; that is, each of the linguistic dimensions is rated on a 
four-category scale. For sections that contain more than one item, the 
mean rating across items is the score for that section. For example, 
Section 3 consists of ten items, each rated for grammar and 
comprehensiM lity. The scores on the ten items are averaged, and the mean 
score beeves the score for the section. 

An overall score on each of the fou, dimensions is obtained by 
averaging across the section scores. For example, the overall score for 
grammar is the average of Gram3 and GramS. The result is a set of four 
overall scores for each examinee from each rater. For score reporting 
purposes, the two sets are averaged ; the average comprehensibility score is 
considered the score for reporting purposes. 



Table 2 

Sections that Contribute to TSE Scores and 
Number of Items per Section 



Pronun- Comprehen- Number of 

Section elation Grammar Fluency sibility items « 



2 


Pron2 




Flu 2 


Comp2 


1 


3 




Gram3 




Comp 3 


10 


A 


PronA 




Flu4 


Comp 4 


1 


5 


Pron5 


Gram5 


Flu 5 


Comp5 


4 


6 


Pron6 




Flu6 


Compb 


3 


7 


Fron7 




Flu 7 


Comp7 


1 


Overall 


Pron 


Gram 


Flu 


Comp 
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If the two raters differ by more than .95 at the overall scora level on 
any linguistic dimensions, a third rater is brought in. The third rating is 
averaged with the other two. 

Description of the Data Base 

The data base used in the study consisted of all available protocols 
from November 1981, the first official administration of the TSb, to June 
1983. Altogether there were 560 examinees in the database, each rated by at 
least two raters. Table 3 shows some descriptive statistics for each of the 
scores under ratings A and B # 



Table 3 



Mean, Standard Deviation, Median, and Interquartile 
Range (1QR) of Ratings A and B (N *= 560) 







Rating A 






Rating B 






Pron 


Flu 


Gram 


Comp 


Pron 


Flu 


Gram 


Comp 


Mean 


1.98 


2.12 


2.32 


2.11 


2.00 


2.14 


2.30 


2.14 


S.D. 


.60 


.55 


.50 


.53 


.62 


.57 


.51 


.54 


Med 


2.00 


2.05 


2.40 


2.07 


2.00 


2.07 


2,38 


2.10 


IQR 


.80 


.70 


.65 


.62 


.88 


.74 


.68 


.70 



Figures 1-4 show the distribution for rating A and rating B on each 
score. It is apparent that the distributions are quite similar, as would 
be expected given the fact that whether a rater provides a rating A or B is 
determined basically at random. 

Differences Between Raters 

For the purposes of this investigation it is important to characterize 
the differences among the ratings since, as indicated above, it is those 
differences that determine whether a third rater is used. Table 4 shows 
descriptive statistics on the differences for the four scores. The 
distribution of the differences tends toward normality, with the mean and 
median close to zero in all cases. However, the variability of the 
differences for pronunciation and fluency scores is markedly greater than 
the variability of the differences for grammar and comprehensibility. 

These findings are reassuring. When differences among the raters are 
pure error, the distribution * is precisely normal with a mean of zero. 
Since the means are very near zero, one can assume the differences are not 
systematic. 
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Grammar Ratings Assigned to 560 Examinees 
Under Ratings A and B 
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560 
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6.61 
8.57 
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23.04 
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0.54 
1.07 
2. 14 
5.54 
10.36 
20.00 
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85.71 
100.00 
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Pronunciation Ratings Assigned to 560 Examinees 
Under Ratings A and B 
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FRE2 




PERCENT 
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0.71 


1 
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0.18 


0 . 89 


10 


15 


1.79 


2 .68 


17 


32 


3.04 


5.71 


37 


69 


6.61 


12.32 


82 


151 


14.64 


26.96 


118 


269 


2 1 . 07 


48.04 


125 


394 


22.32 


70,36 


56 


450 


10.00 


80.36 


45 


495 


8.04 


88. 39 


65 


560 


11.61 


100.00 


4 


4 


0.71 


0.71 


1 


5 


0. 18 


0.89 


12 


17 


2. 14 


3.04 


15 


32 


2.68 


5.71 


53 


85 


9.46 


15. 18 


67 


152 


1 1 . 96 


27. 14 


89 


241 


15.89 


43.04 


142 


383 


25.36 


68.39 


56 


439 


10.00 


78.39 


52 


491 


9.29 


87.68 


69 


560 


12.32 
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Fluency Ratings Assigned to 560 Examinees 
Under Ratings A and B 
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10 


19 
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43 


4 
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7.68 


43 


86 


7 
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18 
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33.57 
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28 


.57 
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76 
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13 


.57 
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68 
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12 
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12 
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Figure 4 
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Comprehensibility Ratings Assigned to 360 Examinees 
Under Ratings A and B 
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Table 4 



Descriptive Statistics of Differences Between 
A and B Ratings on Four Linguistic Skills (N ■ 560) 



Statistics 


Pron 


Flu 


Gram 


Corap 


Mean 


-.02 


-.02 


.02 


-.03 


Standard Dev. 


.44 


.45 


.30 


.38 


Med 


0 


0 


0 


-.01 


1QR 


.60 


.52 


.35 


.45 


99th percentile 


1.01 


1.01 


.71 


.88 


95th percentile 


.78 


.80 


.50 


.66 


90th percentile 


.60 


.60 


.42 


.45 


10th percentile 


-.60 


-.63 


-.34 


-. 52 


5th percentile 


-.73 


-.80 


-.48 


-.65 


1st percentile 


-.93 


-1.09 


-.80 


-.91 



Note: The interquartile range (IQR) is the range 
between the 25th and 75th percentiles. 



Development of a Quality Control Index 

As indicated earlier, a major objective of this investigation was to 
develop and validate an index that could be used to predict disagreement 
among raters. The fact that up to this point examinees have been rated by 
two raters allows us to validate such an index. However, to be useful, the 
index should work in such a way that it could be computed on the ratings 
provided by a single rater. One hypothesis investigated was that the 
language background of an examinee could be implicated In large 
disagreements between raters. Examination of the data, however, did not 
indicate that the frequency of rater disagreement was associated with 
language background. This left two possibilities for predicting 
disagreement: the identity of the rater and some rater^by-ratee 
interaction. We will first examine the latter possibility. 

The rationale of the procedure to detect rater-by-ratee interactions is 
to investigate the underlying statistical model that accounts for the 
covariation among the ratings. Deviations from that underlying model are 
taken to be a possible indication of something unusual about the 
rater-ratee observation. The model that was postulated was a dimensionality 
model. Specifically, it was postulated that the ratings would be 
unidimensional. That is, that a single underlying variable would account 
for the covariation among the ratings. Such a model provides the simplest 
starting point even though it is inconsistent with the current view that 
linguistic skills are not based on a unitary factor. (See Oiler, 1983; 
Vollmer & Sang 1983.) (It is beyond the scope of this report to discuss 
the dimensionality of linguistic skills. As we will see below, a 
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unidimensional model seems adequate to statistically account for the 
covariation among linguistic skills as perceived by raters , but this does 
not preclude t .e possibility that psychologically more thavi one factor is 
needed to account for linguistic performance.) 

Dimensionality was investigated by means of factor analysis. Table 5 
shows the correlations among the four scores for the A and B ratings. As 
can be seen, the intercorrelations are nearly identical. 



Table 5 

Intercorrelation Among the Pour Linguistic Skills 
(Rating A Below the Diagonal, Rating B above) 



Pron 


Pron 


Gram 


Flu 


Comp 




.724 


.829 


.907 


Gram 


.726 




.746 


.797 


Flu 


.821 


.746 




.884 


Comp 


.903 


.798 


.877 





To examine dimensionality as such, we factor analyzed the matrices in 
Table 5. It should be noted that in doing so we ignored the covariation 
among raters embedded in these matrices. That is, the covariation among 
scores could be partitioned into a "between raters" component and a "within 
raters component. For this analysis we implicitly assumed that the 
between-raf.ers component was nil and that each rater was, in fact, 
unidimensional. That is, it was conceivable that by collapsing across 
raters we might create unidimensionality artif actually. Thus, the 
dimensionality of each rater was also analyzed. The results appear in 
Appendix A. It was found that while the fit of a single dimension was not 
equally good across raters, a single dimension was the most reasonable 
model. This does not guarantee that the same dimension is present in each 
rater, but the magnitude of the correlation between the raters points in 
that direction, as we shall see in Table 11. 

The dimensionality of ratings A and B data was investigated by means of 
maximum likelihood factor analysis. A maximum likelihood estimation 
process is statistically most efficient and provides a measure of 
statistical goodness of fit, provided certain distributional assumptions 
are met. (Computations were performed using the SAS statistical package.) 
A single factor was fitted to each matrix. The results of the factor 
analysis, including a statistical measure of fit, appear in Table 6. 
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Table 6 



Results of the Maximum Likelihood Factor Analysis 
Extracting a Single Factor 



Rating A 



Rating B 



Loadings 



Mean-sq 
Residuals 



Loadings 



Mean-sq 
Residuals 



Pron 
Gram 
Flu 
Comp 



.917 
• 812 
.894 
.983 



.01 
.02 
.01 
.00 



.920 
.808 
.899 
.985 



.01 
.02 
.01 
.00 



Chi 7.41 



Chi 7.55 



df 2 
p .025 



df 2 
p .023 



The results strongly suggest that the ratings are in fact unidimensional. 
The probability of the null hypothesis of a single factor is small but 
cannot be entirely relied upon since the data do not follow a multivariate 
normal distribution. More importantly) the root mean squares of the 
off-diagonal residuals do not show a pattern* As Table 6 shows, the 
magnitude of the residuals is small. In other words, it is possible to 
recover the original correlation matrix fairly well with the estimated 
loadings on a single factor. It is also worth noting that the results of 
the factor analyses for ratings A and B are quite similar. That is, the 
largest contributor to the factor is coraprehensibility, followed by 
pronunciation, fluency, and grammar in that order. 

Reliability . Having estimated the parameters of a single factor model, 
we could estimate the internal consistency of rating A and rating B data. 
This was done by following Maxwell's (1971) formulation for estimating 
reliabilities of composite scores. In the present case, the composite 
consisted of the four scores. When there is a single factor in the data, 
reliability is given by the following formula: 




(i) 
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where X is the loading of the ith score on Che single factor. The 
estimate can be obtained by using the estimates of X given in Table 6. For 
rating A the estimated internal consistency reliability across four scores 
was .976; it was .978 for rating B. These estimates suggest a high degree 
of consistency in the ratings but are different from estimates of 
interrater reliability. 

. . Relationship between ratings A and B. Since the data collected up to 
this point were for two raters, we could perform an analysis shedding 
further light on the nature of the ratings. Specifically, we could conduct 
an analysis similar to the one above but incorporating both sets of ratings 
for each examinee. The resulting correlation matrix between ratings A and 
a PP ea " in Table 7. The interrater correlations on the four scores 
appear in parentheses. The correlation is highest for grammar and lowest 
for fluency. 



Table 7 



Intercorrelation Matrix for Ratings A and B, Including the 
Factor Loadings and Residuals for a One-Factor Model 





APRON 


AGKAM 


AFLU 


ACOMP 


APRON 










AGRAM 


.726 








AFLU 


.821 


.746 






ACOM 


.903 


.798 


.877 




BPRON 


(.741) 


.673 


.668 


.740 


BGRAM 


.679 


(.827) 


.637 


.726 


BFLU 


.662 


.653 


(.669) 


.680 


BCOM 


.735 


.716 


.670 


(.753) 



BPRON BGRAM BFLU Loading Residual 







.873 


.070 






.838 


.064 






.837 


.079 






.901 


.081 






.881 


.067 


.724 




.841 


.061 


.829 


.746 - 


.849 


.075 


.907 


.797 .884 


.906 


.078 



A single factor was extracted from this correlation matrix. The 
loadings on that single factor appear at the extreme right of the table. 
From a statistical point of view, however, a single factor was not 
sufficient to account for the correlations, as was evidenced by the highly 
significant chi-square statistic (chi-square = 1196.97, df - 20, p < 
.0001). More important, the residual off-diagonal correlations'were 
substantial. The root mean off-diagonal residuals are shown in the far 
right hand column of Table 7. Moreover, there was a specific pattern to 
those residuals. Specifically, the largest residuals were ACOM-APRON 
ACON-AFLU, BCOM-BPRON, and BCOM-BFLU. One possible interpretation of this 
pattern is that, although for the most part the two raters shared the same 
perspective about proficiency, they seemed to differ somewhat with respect 
to how they integrated pronunciation and fluency into the comprehensibility 
rating. Indeeed, the variability of the differences for the pronunciation 
and fluency ratings (see Table 4) is larger than it is for the grammar and 
comprehensibility ratings. 
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Deviations from unidimensionality . The data presented thus far suggest 
that unidimensionality is a reasonable model of speaking proficiency as 
evaluated by raters. Therefore, an index that examines deviation from 
unidimensionality was investigated as a way of predicting instances in 
which an examinee would be rated discrepantly by two raters. The index we 
used was suggested by Gnanadesikan (1977) in a different context, namely, 
u\e identification of multivariate outliers. The rationale of the index 
can best be seen in the bivariate case. Figure 5 shows a scatter plot for 
two variables. The first principal component is the line that minimizes 
the perpendicular distance of each point to this line: the second 
principal component is error. Being high on that component is indicative 
of a peculiarity. For example, suppose the two variables under 
consideration are height and weight. The first principal component would 
be a linear combination of these two variables. If we find subjects high 
on the second principal component, chances are that they would be unusually 
heavy or light for their height. 

The applicability of this rationale to the present application is 
justified by the fact that speaking proficiency seems to be unidimensional. 
Deviations from unidimensionality could thus be viewed as evidence of a 
rater-by-ratee interaction. If that peculiarity can predict discrepancy 
between two raters, we might be justified in using it in a single rater 
system as a quality control mechanism, 

Hie formula for the peculiarity index is given by 



y^ is a vector of ratings for the ith examinee, which in this case 
consists of four scores. 

y is the mean vector of ratings over all examinees 

a\ is the jth principal component. 

p is the number of variables, four in this case, 

q refers to the last q principal components 



P 



(2) 



j=p-q+l 



where 
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Gnanadesikan, R. (1977). Methods for Statistical Data Analysis of 
Multivariate Observations . ~ ' ~~ 

Copyright 0 1.977 by Bell Telephone Laboratories, Inc. 

Reprinted by permission of John Wiley & Sons, Inc., New York 
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Several indices can be computed from this formula. Since we have 
presented evidence that unidimensionality is a reasonable model, we 
computed the index setting q to 3. That is, the second, third, and fourth 
principal components were **aken to be error. In this form the index 
quantifies, under the assumption of unidimensionality, how dissimilar an 
examinee's ratings are from the mean rating obtained by all examinees. 

To compute the index it is first necessary to compute the principal 
components. The four principal components for ratings A and B are given in 
Table 8. 



Table 8 

Principal Components for the Covariance Matrix 
for Rating A and B (N = 560) 







Racing A 






Racing B 






1 


2 3 


4 


1 


2 3 


4 


PRON 


.558 


-.495 -.523 


-.411 


.564 


-.483 -.537 


-.400 


GRAM 


.424 


.860 -.231 


-.165 


.417 


.866 -.218 


-.164 


FLU 


.502 


-.079 .819 


-. 266 


.506 


-.089 .814 


-.271 


COMP 


.506 


-.096 -.041 


.856 


.501 


-.088 -.035 


.860 



The standing of each examinee on these three indices was computed by means 
of equation 1. A roster was prepared containing the index for each ratee 
as well as the ratings provided by two raters and the corresponding 
difference. It quickly became apparent that the magnitude of the 
differences between two raters could not be predicted by the index. Indeed, 
the correlation of the index with the absolute difference between the two 
raters on any of the linguistic skills was no larger than .15. In short, 
it appears that an approach based on a discrepancy index such as the one 
proposed by Gnanadesikan (1978) does not appear useful as a means of 
predicting disagreement between two raters. 

Analysis of Raters 

The second possibility we investigated for predicting rater 
disagreement focused on the individual raters. Table 9 shows the number of 
examinees a given rater was assigned and the mean rating of those 
examinees. The table also shows the mean rating for the same examinees 
given by the raters with whom a given rater had been paired. This 
information gives us an indication of the degree of severity applied by 
each rater. Note, however, that since there is no guarantee that examinees 
are assigned at random to raters, the mean rating awarded by a specific 
rater is not necessarily the best indicator of that rater f s severity. The 
contrast with the mean rating provided by the other raters for the same 
examinees is a better indication of whether a rater has a tendency to 
overrate or underrate examinees. 
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Table 9 

Means for Each Rater and the Paired Raters on Four Linguistic Dimensions 





Means 


for each 


rater 




Means 


for paired raters 




















Tn 


rron» 


Gram. 


rxu. 


Comp • 


Pron. 


Gram. 


Flu. 


Comp. 


1 1 l 


1 AA 
1 • 00 


9 ^1 


1 Q ^ 

i . y j 


1 OA 
i » 34 


1 • oo 


2.31 


2. 18 


2. 15 


1 1 o 


1 Q9 
1 • ? Z 


9 AA 
Z. 4 0 


9 17 
z • 1 / 


9 OA 
Z« U4 


Z. U / 


2. 37 


2. 26 


O Art 

2. 20 


1 1 A 

114 


9 07 

z» u/ 


9 *. A 
Z. 10 


9 9Q 
Z. Z? 


*J 90 
Z. ZU 


1 Q/i 
1 • 0'* 


2. 14 


1. 99 


1. 96 


1 1 A 

110 


9 9^ 
Z» Zj 


9 AO 
Z • 4U 


9 A 1 
Z. 41 


9 1 1 
Z. Jl 


1 Q "i 


2.34 


2.08 


2.04 


1 


1 • / 3 


Z. 1 J 


1 7 1 
1 * / 1 


1 A7 
1.0/ 




2.33 


2.14 


2.15 


121 


1.81 


2.42 


2.17 


2.05 


2.00 


2.37 


2.12 


2.14 


124 


1.92 


2.37 


2.05 


2.12 


1.95 


2.30 


2.05 


2.07 


125 


1.77 


2.37 


1.84 


1.96 


2.02 


2.31 


1.82 


2,16 


126 


2.07 


2.26 


2.07 


2.26 


2.18 


2.31 


2.04 


2.20 


127 


2.38 


2.32 


2. 12 


2.27 


2.03 


2.34 


2.08 


2.23 


128 


2.10 


2.21 


2.24 


2.21 


1.96 


2.19 


2.10 


2.13 


129 


2.19 


2.28 


2.38 


2.36 


2.00 


2.31 


2.18 


2.09 


130 


1.58 


2.08 


1.66 


1.77 


1.88 


2.17 


2.09 


2.17 


135 


1.97 


2.21 


2.19 


2.22 


2.10 


2.24 


2.25 


2.20 



Table 9 clearly shows that some raters tended to give lower ratings, 
and they did so consistently across all four scores. Table 9 also shows 
that of all raters, raters 120 and 130 were the most severe. If an 
examinee were to be assigned to two raters who tend to underrate, it is 
probable that the examinee would receive a lower rating than if assigned to 
a different pair of raters. This also extends to the case where there is 
just one rater. An examinee assigned to a severe rater might receive a 
lower rating than some other rater might give. 

It should be remembered that Table 9 depicts the distribution of TSE 
raters according to their severity. As in any distribution some 
individuals fall below the mean and some will fall above. The data in 
Table 9 show that raters 114, 118, and 129 were more generous than their 
colleagues. An examinee assigned to two lenient raters might receive a 
higher score than another pair of raters might give. However, the 
practical effect an assignment to two similarly disparate raters is not 
large in terms of scaled score points, and the probability of an examinee 
being assigned to similarly disparate raters is quite low. 
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The difference between a later and his or her paired raters can easily 
be computed from Table 9. Figures 6-9 show the difference for 
pronunciation to comprehensibility , respectively, for each rater. A 
negative difference indicates that the paired raters gave the same 
examinees a higher rating. Again we see that raters 120 and 130 tended to 
give the lowest ratings. For comprehensibility, rater 120 tended to rate 
examinees lower by .28 scale points (K87 vs. 2.15)— which is about half a 
standard deviation with respect to the pooled-within-rater variability of 
the comprehensibility rating. Similarly, for comprehensibility, raters 118 
and 129 tended to rate examinees higher by 27 scale points. We also note 
that the smallest differences among raters are on grammar and the largest 
are on fluency. 

With an indication of the severity of each rater, we were able to 
examine again the distribution of differences to see if specific raters 
tended to exhibit large differences more frequently. Specifically, we 
examined instances where the difference between raters was more than »95, 
the difference that triggers a third rating under the current system. The 
results appear in Table 10. 

Out of 560 examinees, there were 32 instances of discrepancies greater 
than or equal to .95. Rater 120 was involved most frequently in 
discrepancy cases followed by raters 111, 114, 121, and 129, with counts of 
about 10 each. 

Table 10 shows that the largest number of discrepancies greater than 
♦95 occurred in the criterion of fluency. Of the 32 examinees involved in 
such discrepancies, the rating assigned to fluency was at issue in 22 of 
them. Having noticed that the largest number of discrepancies occurred on 
this criterion, TSE program staff revised the descriptive statement given 
raters that accompany each point on the fluency scale in November 1983 . 
Subsequently, staff report a very considerable reduction in the number of 
discrepancies involving fluency. Since this study includes data produced 
as of June 1983, ratings given after this refinement of the fluency scale 
were not analyzed here* 

Prior to November 1983 ratings were assigned by ESL professionals 
living in or near Princeton, New Jersey. However in November 1983 TSE 
program staff decided to utilize as raters graduate students pursuing a 
masters or doctorate in teaching English as a second language at the 
University of Delaware. TSE program staff report about a two-thirds 
decrease in the number of discrepancies with this new group of raters. T 1 
is believed that this improvement is due to the fact that members of this 
group share a common academic background (in terms of core courses), in 
addition to their TSE training, and because the members of the group are in 
almost daily contact with each other. Again, this more recent data was not 
included in this study. However, once it is analyzed, it may result in 
further gains in inter rater agreement. 
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Figure 6 

Mean Difference Between Each Rater 
and the Paired Raters on the Pronunciation Score 
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Figure 7 



Mean Difference Between Each Rater 
and the Paired Rater on the Grammar Score 
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Figure 8 

Mean difference Between Each Rater 
and the Paired Raters on the Fluency Score 
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Fifiure 9 



Mean Difference Between Each Rater 
and the Paired Raters on the Comprehensibility Score 
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Table 10 



Identification of Pairs of Raters Involved in 
Unusually High Discrepancies 



PRON 


GRAM 


FLU 


C0MP 




120-121 


120-121 


120-121 
120-129 




111-114 


120-114 
120-129 
127-111 


120-129 




114-111 


111-120 




111-121 




120-114 
120-114 
120-129 


114-120 


135-121 




120-114 
111-121 
111-121 
114-120 


114-120 


113-118 




120-121 
111-121 
118-121 




128-130 




128-130 
118-113 




128-130 




111-128 




135-121 








118-120 








129-120 








118-12U 




118-120 




111-129 




111-129 
114-120 
129-111 





Note: Although examinee response tapes in all of the 
above discrepancies received a third rating before the 
score was reported, only data from the first two ratings 
were included in this study. 
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Consistency and validity of individual raters* The question of 
standards is quite separate from that of consistency and validity. That 
is, a rater could consistently give lower ratings and yet give ratings that 
correlate highly with those of othct raters. To obtain measures of 
individual raters' validity, we correlated the ratings of individual raters 
with the ratings of the paired raters. The results are shown in Table 11. 



Table il 

Correlations of Individual Raters with Paired 
Raters on Each Linguistic Dimension 



Rater 
ID. 


N 


PRON 


GRAM 


FLU 


C0MP 




Ill 


93 


.75 


.77 


.61 


.81 


* 


113 


141 


.72 


.81 


.62 


.68 




114 


59 


.87 


.75 


.74 


.83 


* 


118 


174 


.77 


.80 


.71 


.76 




120 


151 


.79 


.85 


.75 


.82 


* 


121 


119 


.82 


.82 


.70 


.77 


* 


124 


39 


.83 


.89 


. 82 


.88 




125 


13 


.74 


.87 


.88 


.90 




126 


22 


.77 


.93 


.88 


.89 




127 


33 


.89 


.93 


.79 


.88 




128 


75 


.84 


.91 


.86 


.92 




129 


89 


.80 


.85 


.81 


.82 




130 


13 


.23 


.80 


.57 


.82 




135 


75 


.87 


.90 


.87 


.91 




Median 




.79 


.85 


.77 


.83 




Clark & Swinton 


.77 


.85 


.79 


.79 





Note: The median correlations depicted here 
represent the interrater reliability of a TSE score 
based on a single rating. Official TSE scores are 
based on two or in some cases three ratings. Thus 
the reliability of official TSE scores is 
considerably higher. 

There is a substantial range of correlation for each linguistic 
dimension, but the correlations tend to be in the .70s and .80s. The 
median correlation is reported at the bottom of the table along with the 
interrater reliability estimates obtained by Clark and Swinton (1980). It 
cannot be said that a given rater consistently correlates lower, except for 
rater 130, who showed a very low correlation on pronunciation and fluency. 
More important, the five raters identified earlier as generating most of 
the large discrepancies (marked by asterisks) correlate as well with other 
raters as anyone else did. Thus, interrater reliability does not appear to 
determine the likelihood of being involved in a discrepancy. 
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It is perhaps noteworthy that the data show that the TSE program is 
obtaining improvements in the degree of agreement among raters. The last 
two rows of Table 11 offer a comparison between this data and the 
interrater correlations obtained by Clark and Swinton (1980) in their 
research study* In general, operational data used in this study show a 
greater degree of agreement than that obtained in the earlier study. 
Recently, the program initiated new rating procedures. Two of these 
procedures should result in further gains in interrater agreement. The 
first new procedure involved referencing the descriptions utilized by 
raters in assigning ratings for fluency. The second involved an attempt to 
improve communication among raters. 

Summary and Conclusions 

This investigation was undertaken to provide information about the 
feasibility of using one rater instead of the two that are now used for the 
TSE. The results suggest that a single rater system would yield highly 
internally consistent data across scores and that the data could be 
described by a unidimensional model. Working from that result we examined 
the possibility that deviations from unidimensionality could be used as a 
quality-control mechanism to detect instances in which there would be large 
disagreement if a second rater were involved. The approach that was 
investigated could not be validated. 

We then turned to an analysis of individual raters. The data clearly 
showed that at least two of the raters appeared to have considerably 
stricter standards. One of these, rater 130, also had substantially lower 
correlations on two of the four linguistic dimensions, but rated only 
thirteen examinees. Rater 120 was involved in a large number of unusually 
high disagreements; however, this rater correlated as highly with the 
paired rater as did any other rater. 

The foregoing leads to the following conclusion: Because of the 
possibility of different standards among potential raters, it does not 
appear feasible to use a single rater as the sole determiner of speaking 
proficiency at this time. In the remainder of this section two possible 
alternatives, consistent with the original motivation for the study, will 
be discussed. One of these possibilities is psychometric; the other is 
technological. 

One possible solution to the problem of different standards among 
raters is to exclude frou the pool those raters who are too severe or too 
lenient. A more psychome v .rically oriented solution is to view raters as 
test forms and to equate them, much as test forms are equated to control 
for differences in the difficulty of test forms. Although the author is 
not aware of any testing programs that equates raters, the idea has at 
least been discussed (de Gruijter, 1980; Pilliner* 1958). Such a 
psychometric solution would probably require a specialized data collection 
design. Nevertheless, this study has shown that if we view raters as test 
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forms, there is reason to believe that their ratings are sufficiently 

reliable and valid, in the sense of rank ordering examinees in the same 

fashion as the other raters. Therefore, the idea of equating raters seems 
feasible from a psychometric point of view. 

The second possibility is to use multiple raters, each of whom would be 
asked to rate only a part of the examinees 1 performance. That is, different 
sections of the test could be assigned to different raters. Since the 
examinees' performance is on tape, it would seem that technological help 
would be needed. To implement this idea, a system would have to be 
developed to efficiently create tapes containing portions of examinees' 
performance. A tape containing the same segment of performance from 
several examinees would be sent to each rater, who in effect would rate 
only part of the entire test. The rater would return as many scoring 
sheets as there were examinees on the tape, but would complete only some 
parts of the form. That information would, of course, have to be entered 
into a computer, which would pull together the ratings from several raters 
to produce a reportable score. 

The purely psychometric solution of equating raters is likely to be 
less expensive. A disadvantage is that the evaluation of an examinee's 
performance would be based solely on the judgment of a single rater. Even 
after equating raters there is a possibility that a peculiar rater-examinee 
interaction could have an effect on the resulting score. By contrast, the 
second possibility, by involving several raters, would control not only for 
the different standards that raters might have but also for any possible 
rater-examinee interaction. 

whatever system is ultimately adopted, the potential vulnerability of 
individual raters to different criteria should be borne in mind. The 
present system, even though it uses two raters, is not free £tom the 
problem. The results of this investigation suggest that it is imperative 
to monitor individual raters on a regular basis. An operational system of 
monitoring, followed by immediate recalibration when necessary, has the 
potential to maintain rater reliability at a uniformly high level, as well 
as uniform standards across raters. Such monitoring could also eventully 
allow the use of a single rater. 
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Loadings and residuals for a one-factor model estimated for each rater 



Loadings Residual s 



N 


Rater ID 


Pron. 


Gram. 


Flu. 


Comp. 


Pron. 


Gram. 


Flu. 


Comp 


93 


111 








Communalities > 


1.0* 






141 


113 


.90 


. 75 


. 79 


.92 


.02 


.08 


. 08 


.02 


59 


114 








Communalities > 


1.0* 






174 


118 


.85 


.88 


.84 


.99 


.03 


.02 


.03 


.00 


151 


120 


.95 


.91 


.96 


.99 


.02 


.02 


.01 


.00 


119 


121 


.96 


.78 


.91 


.98 


.03 


.04 


.03 


.01 


39 


124 


.90 


.92 


.89 


.99 


.02 


.04 


.03 


.00 


13 


125 


.97 


.84 


.97 


.98 


.01 


.04 


.03 


.02 


22 


126 


.98 


.94 


.90 


.99 


.00 


.01 


.01 


.01 


33 


127 


.98 


.95 


.95 


.91 


.01 


.02 


.03 


.03 


75 


128 








Communalities > 


1.0* 






89 


129 


.92 


.86 


.91 


.99 


.02 


.02 


.02 


.00 


13 


130 


.74 


.82 


.94 


.99 


.11 


.12 


.02 


.01 


75 


135 


.93 


.92 


.96 


.98 


.01 


.01 


.01 


.00 



*lt was not possible to estimate the parameter of the factor model for 
raters where one or more of the communalities were greater than 1. 
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